Various depth estimation models are now widely used on many mobile and IoT devices for image segmentation, bokeh effect rendering, object tracking and many other mobile tasks. Thus, it is very crucial to have efficient and accurate depth estimation models that can run fast on low-power mobile chipsets. In this Mobile AI challenge, the target was to develop deep learning-based single image depth estimation solutions that can show a real-time performance on IoT platforms and smartphones. For this, the participants used a large-scale RGB-to-depth dataset that was collected with the ZED stereo camera capable to generated depth maps for objects located at up to 50 meters. The runtime of all models was evaluated on the Raspberry Pi 4 platform, where the developed solutions were able to generate VGA resolution depth maps at up to 27 FPS while achieving high fidelity results. All models developed in the challenge are also compatible with any Android or Linux-based mobile devices, their detailed description is provided in this paper.
translated by 谷歌翻译
视觉导航要求代理商遵循自然语言说明以达到特定目标。可见的环境和看不见的环境之间的巨大差异使代理商概括良好的挑战。先前的研究提出了数据增强方法,以明确或隐式地减轻数据偏见并提供概括的改进。但是,他们试图记住增强的轨迹,并在测试时忽略在看不见的环境下的分布变化。在本文中,我们提出了一个看不见的差异,预期视力和语言导航(戴维斯),该差异通过鼓励测试时间的视觉一致性来概括为看不见的环境。具体来说,我们设计了:1)半监督框架戴维斯(Davis),该框架利用类似的语义观测来利用视觉一致性信号。 2)一个两阶段的学习程序,鼓励适应测试时间分布。该框架增强了模仿和强化学习的基本混合物与动量形成对比,以鼓励在联合训练阶段和测试时间适应阶段对类似观察的稳定决策。广泛的实验表明,戴维斯在R2R和RXR基准上实现了与先前最先进的VLN基线相比,取得了模型不合命源性的改进。我们的源代码和数据是补充材料。
translated by 谷歌翻译
凝视估计是一种确定一个人在何处看着该人的脸的方法,是理解人类意图的宝贵线索。与其他计算机视觉领域类似,深度学习(DL)方法在凝视估计域中获得了认可。但是,凝视估计域中仍然存在凝视校准问题,从而阻止了现有方法进一步改善性能。一个有效的解决方案是直接预测两只人眼的差异信息,例如差异网络(DIFF-NN)。但是,此解决方案仅使用一个推理图像时会导致准确性丧失。我们提出了一个差异残差模型(DRNET)与新的损失函数相结合,以利用两个眼睛图像的差异信息。我们将差异信息视为辅助信息。我们主要使用两个公共数据集(1)mpiigaze和(2)Eyediap评估了提出的模型(DRNET)。仅考虑眼睛功能,DRNET分别使用Mpiigigaze和EyeDiap数据集以$ Angular-Error $为4.57和6.14的最先进的目光估计方法。此外,实验结果还表明,DRNET对噪声图像非常强大。
translated by 谷歌翻译
语言规划旨在通过分解为更简单的低级步骤来实现复杂的高级目标。这种程序推理能力对于诸如家用机器人和虚拟助手等应用至关重要。尽管语言规划是日常生活中人类的基本技能,但对于缺乏现实世界中缺乏深层常识性知识的大型语言模型(LLM)来说,这仍然是一个挑战。以前的方法需要手动示例或带注释的程序才能从LLM中获取此类能力。相比之下,本文提出了神经符号的因果语言规划师(CLAP),该策划者通过注入常识的提示从LLM中引起了程序知识。 LLMS中的预训练知识本质上是一种未观察到的混杂因素,它在任务和行动计划之间引起虚假的相关性。通过结构性因果模型(SCM)的镜头,我们提出了一个有效的策略,以构建提示作为对SCM的因果干预。我们的策略使用图形采样技术和符号程序执行者,正式从常识知识基础上形成结构化因果提示。拍手在Wikihow和机器人上获得最新的表现,在反事实环境下,人类评估的相对提高了5.28%。这表明在语义和顺序的因果语言规划中拍手的优势。
translated by 谷歌翻译
我们展示了一个简单的无监督掩蔽目标可以在抽象多文件新闻摘要上接近受监督性能。我们的方法列举了最先进的神经摘要模型,以预测相对于多文件组的最高词汇中心的蒙面输出源文档。在对多新闻数据集的实验中,我们蒙版的培训目标会产生一个系统,优势超过无监督的方法,并且在人类评估中超越了最佳监督方法,而无需访问任何地面真实的摘要。此外,我们评估了词汇中心的不同措施,灵感来自过去的采取摘要,影响最终表现。
translated by 谷歌翻译
变压器在许多视觉任务上表现出优选的性能。然而,对于人的任务重新识别(Reid),Vanilla变形金刚将丰富的背景留下了高阶特征关系,这是由于行人的戏剧性变化而不足的局部特征细节。在这项工作中,我们提出了一个全部关系高阶变压器(OH-Figrain)来模拟Reid的全系关系功能。首先,为了加强视觉表示的能力,而不是基于每个空间位置的对查询和隔离键获得注意矩阵,我们进一步逐步以模拟非本地机制的高阶统计信息。我们以先前的混合机制在每个订单的相应层中共享注意力,以降低计算成本。然后,提出了一种基于卷积的本地关系感知模块来提取本地关系和2D位置信息。我们模型的实验结果是优越的有前途,其在市场上显示出最先进的性能-1501,Dukemtmc,MSMT17和occluded-Duke数据集。
translated by 谷歌翻译
In person re-identification (ReID) tasks, many works explore the learning of part features to improve the performance over global image features. Existing methods extract part features in an explicit manner, by either using a hand-designed image division or keypoints obtained with external visual systems. In this work, we propose to learn Discriminative implicit Parts (DiPs) which are decoupled from explicit body parts. Therefore, DiPs can learn to extract any discriminative features that can benefit in distinguishing identities, which is beyond predefined body parts (such as accessories). Moreover, we propose a novel implicit position to give a geometric interpretation for each DiP. The implicit position can also serve as a learning signal to encourage DiPs to be more position-equivariant with the identity in the image. Lastly, a set of attributes and auxiliary losses are introduced to further improve the learning of DiPs. Extensive experiments show that the proposed method achieves state-of-the-art performance on multiple person ReID benchmarks.
translated by 谷歌翻译
Node classification for graph-structured data aims to classify nodes whose labels are unknown. While studies on static graphs are prevalent, few studies have focused on dynamic graph node classification. Node classification on dynamic graphs is challenging for two reasons. First, the model needs to capture both structural and temporal information, particularly on dynamic graphs with a long history and require large receptive fields. Second, model scalability becomes a significant concern as the size of the dynamic graph increases. To address these problems, we propose the Time Augmented Dynamic Graph Neural Network (TADGNN) framework. TADGNN consists of two modules: 1) a time augmentation module that captures the temporal evolution of nodes across time structurally, creating a time-augmented spatio-temporal graph, and 2) an information propagation module that learns the dynamic representations for each node across time using the constructed time-augmented graph. We perform node classification experiments on four dynamic graph benchmarks. Experimental results demonstrate that TADGNN framework outperforms several static and dynamic state-of-the-art (SOTA) GNN models while demonstrating superior scalability. We also conduct theoretical and empirical analyses to validate the efficiency of the proposed method. Our code is available at https://sites.google.com/view/tadgnn.
translated by 谷歌翻译
Background and Purpose: Colorectal cancer is a common fatal malignancy, the fourth most common cancer in men, and the third most common cancer in women worldwide. Timely detection of cancer in its early stages is essential for treating the disease. Currently, there is a lack of datasets for histopathological image segmentation of rectal cancer, which often hampers the assessment accuracy when computer technology is used to aid in diagnosis. Methods: This present study provided a new publicly available Enteroscope Biopsy Histopathological Hematoxylin and Eosin Image Dataset for Image Segmentation Tasks (EBHI-Seg). To demonstrate the validity and extensiveness of EBHI-Seg, the experimental results for EBHI-Seg are evaluated using classical machine learning methods and deep learning methods. Results: The experimental results showed that deep learning methods had a better image segmentation performance when utilizing EBHI-Seg. The maximum accuracy of the Dice evaluation metric for the classical machine learning method is 0.948, while the Dice evaluation metric for the deep learning method is 0.965. Conclusion: This publicly available dataset contained 5,170 images of six types of tumor differentiation stages and the corresponding ground truth images. The dataset can provide researchers with new segmentation algorithms for medical diagnosis of colorectal cancer, which can be used in the clinical setting to help doctors and patients.
translated by 谷歌翻译
This paper studies how to flexibly integrate reconstructed 3D models into practical 3D modeling pipelines such as 3D scene creation and rendering. Due to the technical difficulty, one can only obtain rough 3D models (R3DMs) for most real objects using existing 3D reconstruction techniques. As a result, physically-based rendering (PBR) would render low-quality images or videos for scenes that are constructed by R3DMs. One promising solution would be representing real-world objects as Neural Fields such as NeRFs, which are able to generate photo-realistic renderings of an object under desired viewpoints. However, a drawback is that the synthesized views through Neural Fields Rendering (NFR) cannot reflect the simulated lighting details on R3DMs in PBR pipelines, especially when object interactions in the 3D scene creation cause local shadows. To solve this dilemma, we propose a lighting transfer network (LighTNet) to bridge NFR and PBR, such that they can benefit from each other. LighTNet reasons about a simplified image composition model, remedies the uneven surface issue caused by R3DMs, and is empowered by several perceptual-motivated constraints and a new Lab angle loss which enhances the contrast between lighting strength and colors. Comparisons demonstrate that LighTNet is superior in synthesizing impressive lighting, and is promising in pushing NFR further in practical 3D modeling workflows. Project page: https://3d-front-future.github.io/LighTNet .
translated by 谷歌翻译